Identifying Semantically Deviating Outlier Documents
نویسندگان
چکیده
A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.
منابع مشابه
Semantic Tagging of Domain-Specific Text Documents with DIAsDEM
Large volumes of electronically available information are stored in textual form. The extraction of semantics from these documents and the characterization of their contents into a databaselike schema is a necessary prerequisite for efficient search and for the fusion of documents semantically belonging together, be they documents about the same company, police reports or legal attests related ...
متن کاملOutRules: A Framework for Outlier Descriptions in Multiple Context Spaces
Analyzing exceptional objects is an important mining task. It includes the identification of outliers but also the description of outlier properties in contrast to regular objects. However, existing detection approaches miss to provide important descriptions that allow human understanding of outlier reasons. In this work we present OutRules, a framework for outlier descriptions that enable an e...
متن کاملComparative Analysis of Outlier Detection Techniques
Data Mining simply refers to the extraction of very interesting patterns of the data from the massive data sets. Outlier detection is one of the important aspects of data mining which actually finds out the observations that are deviating from the common expected behavior. Outlier detection and analysis is sometimes known as outlier mining. In this paper, we have tried to provide the broad and ...
متن کاملOutlier Document Filtering Applied to the Extractive Summarization
Summarization requires selection of the more informative sentences within a set of documents. Generally, process assumes the document set includes related topics to a subject. However, some of the documents may be outlier and the effect of an outlier document might affect the success of extractive summary. Research is focused on filtering documents at the extraction stage these are outlier. Ext...
متن کاملDeepDetect: An Extensible System for Detecting Attribute Outliers & Duplicates in XML
XML, the eXtensible Markup Language, is fast evolving into the new standard for data representation and exchange on the WWW. This has resulted in a growing number of data cleaning techniques to locate “dirty” data (artifacts). In this paper, we present DeepDetect – an extensible system that detects attribute outliers and duplicates in XML documents. Attribute outlier detection finds objects tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017